SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations

Recent years have seen the successful application of large pretrained models to code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding their application to SE tasks. First, the majority of the pre-trained models foc...

Full description

Saved in:

Bibliographic Details
Published in:	2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE) pp. 01 - 13
Main Authors:	Niu, Changan, Li, Chuanyi, Ng, Vincent, Ge, Jidong, Huang, Liguo, Luo, Bin
Format:	Conference Proceeding
Language:	English
Published:	ACM 01.05.2022
Subjects:	code representation learning Codes Computer architecture Decoding Natural languages pre-training Representation learning sequence-to-sequence Task analysis Transformers
ISSN:	1558-1225
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	Recent years have seen the successful application of large pretrained models to code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding their application to SE tasks. First, the majority of the pre-trained models focus on pre-training only the encoder of the Transformer. For generation tasks that are addressed using models with the encoder-decoder architecture, however, there is no reason why the decoder should be left out during pre-training. Second, many existing pre-trained models, including state-of-the-art models such as T5-learning, simply reuse the pretraining tasks designed for natural languages. Moreover, to learn the natural language description of source code needed eventually for code-related tasks such as code summarization, existing pretraining tasks require a bilingual corpus composed of source code and the associated natural language description, which severely limits the amount of data for pre-training. To this end, we propose SPT-Code, a sequence-to-sequence pre-trained model for source code. In order to pre-train SPT-Code in a sequence-to-sequence manner and address the aforementioned weaknesses associated with existing pre-training tasks, we introduce three pre-training tasks that are specifically designed to enable SPT-Code to learn knowledge of source code, the corresponding code structure, as well as a natural language description of the code without relying on any bilingual corpus, and eventually exploit these three sources of information when it is applied to downstream tasks. Experimental results demonstrate that SPT-Code achieves state-of-the-art performance on five code-related downstream tasks after fine-tuning.
AbstractList	Recent years have seen the successful application of large pretrained models to code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding their application to SE tasks. First, the majority of the pre-trained models focus on pre-training only the encoder of the Transformer. For generation tasks that are addressed using models with the encoder-decoder architecture, however, there is no reason why the decoder should be left out during pre-training. Second, many existing pre-trained models, including state-of-the-art models such as T5-learning, simply reuse the pretraining tasks designed for natural languages. Moreover, to learn the natural language description of source code needed eventually for code-related tasks such as code summarization, existing pretraining tasks require a bilingual corpus composed of source code and the associated natural language description, which severely limits the amount of data for pre-training. To this end, we propose SPT-Code, a sequence-to-sequence pre-trained model for source code. In order to pre-train SPT-Code in a sequence-to-sequence manner and address the aforementioned weaknesses associated with existing pre-training tasks, we introduce three pre-training tasks that are specifically designed to enable SPT-Code to learn knowledge of source code, the corresponding code structure, as well as a natural language description of the code without relying on any bilingual corpus, and eventually exploit these three sources of information when it is applied to downstream tasks. Experimental results demonstrate that SPT-Code achieves state-of-the-art performance on five code-related downstream tasks after fine-tuning.
Author	Li, Chuanyi Luo, Bin Ng, Vincent Ge, Jidong Huang, Liguo Niu, Changan
Author_xml	– sequence: 1 givenname: Changan surname: Niu fullname: Niu, Changan email: nougatca@qq.com organization: Nanjing University,State Key Laboratory for Novel Software Technology,Nanjing,China – sequence: 2 givenname: Chuanyi surname: Li fullname: Li, Chuanyi email: lcy@nju.edu.cn organization: Nanjing University,State Key Laboratory for Novel Software Technology,Nanjing,China – sequence: 3 givenname: Vincent surname: Ng fullname: Ng, Vincent email: vince@hlt.utdallas.edu organization: Human Language Technology Research Institute University of Texas at Dallas,Richardson,Texas,USA – sequence: 4 givenname: Jidong surname: Ge fullname: Ge, Jidong email: gjd@nju.edu.cn organization: Nanjing University,State Key Laboratory for Novel Software Technology,Nanjing,China – sequence: 5 givenname: Liguo surname: Huang fullname: Huang, Liguo email: lghuang@lyle.smu.edu organization: Southern Methodist University,Dept. of Computer Science,Dallas,Texas,USA – sequence: 6 givenname: Bin surname: Luo fullname: Luo, Bin email: luobin@nju.edu.cn organization: Nanjing University,State Key Laboratory for Novel Software Technology,Nanjing,China
BookMark	eNo1j0tPwzAQhA0Cibb0zIFL_oCL14845oYqXlIkKhLOlROvURDYxUkP_HtMgdPMar5dzc7JSYgBCbkAtgKQ6kooYIyJ1UFNeUSWRlc5YMJwDnBMZqBURYFzdUbm4_iW6VIaMyNts2npOjq8Lhr83GPokU6R_vtik5C2yQ5hCK-Fj6mo0abD0MR9ysDPbvGMu4QjhslOQwzjOTn19n3E5Z8uyMvdbbt-oPXT_eP6pqaWazVRr0ubG7oOe91DZ5SWovO5FvPaAXjlrS6dYhJ8WXILUlqrmOOOS1Xlh8SCXP7eHRBxu0vDh01fW6ONMIKJb1ftULM
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1145/3510003.3510096
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISBN	9781450392211 1450392210
EISSN	1558-1225
EndPage	13
ExternalDocumentID	9793930
Genre	orig-research
GrantInformation_xml	– fundername: National Natural Science Foundation of China grantid: 61802167,61802095 funderid: 10.13039/501100001809 – fundername: Natural Science Foundation of Jiangsu Province, China grantid: BK20201250 funderid: 10.13039/501100004608 – fundername: NSF grantid: 2034508 funderid: 10.13039/100000001
GroupedDBID	-~X .4S .DC 123 23M 29O 5VS 6IE 6IF 6IH 6IK 6IL 6IM 6IN 8US AAJGR AAWTH ABLEC ADZIZ AFFNX ALMA_UNASSIGNED_HOLDINGS APO ARCSS AVWKF BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO EDO FEDTE I-F I07 IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO RNS XOL
ID	FETCH-LOGICAL-a275t-f76a039dbec7c1b95743bf4990f7d11f5fa76d5041f662a144aa50d2d24581553
IEDL.DBID	RIE
ISICitedReferencesCount	72
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000832185400162&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Aug 27 02:28:32 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a275t-f76a039dbec7c1b95743bf4990f7d11f5fa76d5041f662a144aa50d2d24581553
PageCount	13
ParticipantIDs	ieee_primary_9793930
PublicationCentury	2000
PublicationDate	2022-May
PublicationDateYYYYMMDD	2022-05-01
PublicationDate_xml	– month: 05 year: 2022 text: 2022-May
PublicationDecade	2020
PublicationTitle	2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)
PublicationTitleAbbrev	ICSE
PublicationYear	2022
Publisher	ACM
Publisher_xml	– name: ACM
SSID	ssj0006499 ssj0002871777
Score	2.5404866
Snippet	Recent years have seen the successful application of large pretrained models to code representation learning, resulting in substantial improvements on many...
SourceID	ieee
SourceType	Publisher
StartPage	01
SubjectTerms	code representation learning Codes Computer architecture Decoding Natural languages pre-training Representation learning sequence-to-sequence Task analysis Transformers
Title	SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations
URI	https://ieeexplore.ieee.org/document/9793930
WOSCitedRecordID	wos000832185400162&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZKxcBUoEW85YERt3nZjlkrKgZUVTSgbpWfiCVBbcrv5-y67cLCFCuSo-gS-7473_cdQg8MfJYRWpOCSkYKWEFEGKUJ0zxV4LKEDZJCH698Oi0XCzHroMc9F8ZaG4rP7NAPw1m-afTGp8pGAn4mkUOAfsQ523K19vkUj_yDtF3chRlA-SjlkxZ0lFOfyM6H4RoU-g-9VIIrmfT-9xKnaHDg5OHZ3tucoY6tz1Fv15QBxzXaR9V8VpFxY-wTnscyadI2ZDeGZ1hSxb4QGBArjgqrn3ge8vjYz8VvoUA28pLq9QC9T56r8QuJrROIzDhtieNMJjlY3WquUyUoAAXlwCSJ4yZNHXWSM0OTInWMZRKiKilpYjKTFbT0rYQuULduanuJsJMZRDmpNIG0WipVJFL5MEUBkgD_doX63kjL7606xjLa5_rv2zfoJPMEglAyeIu67Wpj79Cx_mm_1qv78El_ATuRoFE
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8JAEJ4QNNETKhjf7sGjC33tlnolEoxIiFTDjeyrxktroPj7nV0XuHjx1E2TbZppd-ex830fwB1Hn6UzpWjCBKcJriCaaakoV2ko0WVlxlEKvY_TyaQ_n2fTBtxvsTDGGNd8Zrp26M7ydaXWtlTWy_BnymJM0PescpZHa20rKjb2d-R2fh_mGMx7Mp8wYb2Y2VJ23HVXx9G_U1NxzmTY-t9rHEFnh8oj062_OYaGKU-gtZFlIH6VtiGfTXM6qLR5IDPfKE3rim7G-AxDc68MQTBmJZ5j9YPMXCWf2Lnk1bXIemRSuerA2_AxH4yoF0-gIkpZTYuUiyBGuxuVqlBmDEMFWaBJgiLVYViwQqRcsyAJC84jgXmVECzQkY4S1rdiQqfQLKvSnAEpRIR5Tii0g632pUwCIW2iIjGWQA93Dm1rpMXXLz_Gwtvn4u_bt3Awyl_Gi_HT5PkSDiMLJ3ANhFfQrJdrcw376rv-XC1v3Of9Acgdo5o
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2022+IEEE%2FACM+44th+International+Conference+on+Software+Engineering+%28ICSE%29&rft.atitle=SPT-Code%3A+Sequence-to-Sequence+Pre-Training+for+Learning+Source+Code+Representations&rft.au=Niu%2C+Changan&rft.au=Li%2C+Chuanyi&rft.au=Ng%2C+Vincent&rft.au=Ge%2C+Jidong&rft.date=2022-05-01&rft.pub=ACM&rft.eissn=1558-1225&rft.spage=01&rft.epage=13&rft_id=info:doi/10.1145%2F3510003.3510096&rft.externalDocID=9793930