SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations

Recent years have seen the successful application of large pretrained models to code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding their application to SE tasks. First, the majority of the pre-trained models foc...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE) s. 01 - 13
Hlavní autoři: Niu, Changan, Li, Chuanyi, Ng, Vincent, Ge, Jidong, Huang, Liguo, Luo, Bin
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: ACM 01.05.2022
Témata:
ISSN:1558-1225
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Recent years have seen the successful application of large pretrained models to code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding their application to SE tasks. First, the majority of the pre-trained models focus on pre-training only the encoder of the Transformer. For generation tasks that are addressed using models with the encoder-decoder architecture, however, there is no reason why the decoder should be left out during pre-training. Second, many existing pre-trained models, including state-of-the-art models such as T5-learning, simply reuse the pretraining tasks designed for natural languages. Moreover, to learn the natural language description of source code needed eventually for code-related tasks such as code summarization, existing pretraining tasks require a bilingual corpus composed of source code and the associated natural language description, which severely limits the amount of data for pre-training. To this end, we propose SPT-Code, a sequence-to-sequence pre-trained model for source code. In order to pre-train SPT-Code in a sequence-to-sequence manner and address the aforementioned weaknesses associated with existing pre-training tasks, we introduce three pre-training tasks that are specifically designed to enable SPT-Code to learn knowledge of source code, the corresponding code structure, as well as a natural language description of the code without relying on any bilingual corpus, and eventually exploit these three sources of information when it is applied to downstream tasks. Experimental results demonstrate that SPT-Code achieves state-of-the-art performance on five code-related downstream tasks after fine-tuning.
AbstractList Recent years have seen the successful application of large pretrained models to code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding their application to SE tasks. First, the majority of the pre-trained models focus on pre-training only the encoder of the Transformer. For generation tasks that are addressed using models with the encoder-decoder architecture, however, there is no reason why the decoder should be left out during pre-training. Second, many existing pre-trained models, including state-of-the-art models such as T5-learning, simply reuse the pretraining tasks designed for natural languages. Moreover, to learn the natural language description of source code needed eventually for code-related tasks such as code summarization, existing pretraining tasks require a bilingual corpus composed of source code and the associated natural language description, which severely limits the amount of data for pre-training. To this end, we propose SPT-Code, a sequence-to-sequence pre-trained model for source code. In order to pre-train SPT-Code in a sequence-to-sequence manner and address the aforementioned weaknesses associated with existing pre-training tasks, we introduce three pre-training tasks that are specifically designed to enable SPT-Code to learn knowledge of source code, the corresponding code structure, as well as a natural language description of the code without relying on any bilingual corpus, and eventually exploit these three sources of information when it is applied to downstream tasks. Experimental results demonstrate that SPT-Code achieves state-of-the-art performance on five code-related downstream tasks after fine-tuning.
Author Li, Chuanyi
Luo, Bin
Ng, Vincent
Ge, Jidong
Huang, Liguo
Niu, Changan
Author_xml – sequence: 1
  givenname: Changan
  surname: Niu
  fullname: Niu, Changan
  email: nougatca@qq.com
  organization: Nanjing University,State Key Laboratory for Novel Software Technology,Nanjing,China
– sequence: 2
  givenname: Chuanyi
  surname: Li
  fullname: Li, Chuanyi
  email: lcy@nju.edu.cn
  organization: Nanjing University,State Key Laboratory for Novel Software Technology,Nanjing,China
– sequence: 3
  givenname: Vincent
  surname: Ng
  fullname: Ng, Vincent
  email: vince@hlt.utdallas.edu
  organization: Human Language Technology Research Institute University of Texas at Dallas,Richardson,Texas,USA
– sequence: 4
  givenname: Jidong
  surname: Ge
  fullname: Ge, Jidong
  email: gjd@nju.edu.cn
  organization: Nanjing University,State Key Laboratory for Novel Software Technology,Nanjing,China
– sequence: 5
  givenname: Liguo
  surname: Huang
  fullname: Huang, Liguo
  email: lghuang@lyle.smu.edu
  organization: Southern Methodist University,Dept. of Computer Science,Dallas,Texas,USA
– sequence: 6
  givenname: Bin
  surname: Luo
  fullname: Luo, Bin
  email: luobin@nju.edu.cn
  organization: Nanjing University,State Key Laboratory for Novel Software Technology,Nanjing,China
BookMark eNo1j0tPwzAQhA0Cibb0zIFL_oCL14845oYqXlIkKhLOlROvURDYxUkP_HtMgdPMar5dzc7JSYgBCbkAtgKQ6kooYIyJ1UFNeUSWRlc5YMJwDnBMZqBURYFzdUbm4_iW6VIaMyNts2npOjq8Lhr83GPokU6R_vtik5C2yQ5hCK-Fj6mo0abD0MR9ysDPbvGMu4QjhslOQwzjOTn19n3E5Z8uyMvdbbt-oPXT_eP6pqaWazVRr0ubG7oOe91DZ5SWovO5FvPaAXjlrS6dYhJ8WXILUlqrmOOOS1Xlh8SCXP7eHRBxu0vDh01fW6ONMIKJb1ftULM
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1145/3510003.3510096
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9781450392211
1450392210
EISSN 1558-1225
EndPage 13
ExternalDocumentID 9793930
Genre orig-research
GrantInformation_xml – fundername: National Natural Science Foundation of China
  grantid: 61802167,61802095
  funderid: 10.13039/501100001809
– fundername: Natural Science Foundation of Jiangsu Province, China
  grantid: BK20201250
  funderid: 10.13039/501100004608
– fundername: NSF
  grantid: 2034508
  funderid: 10.13039/100000001
GroupedDBID -~X
.4S
.DC
123
23M
29O
5VS
6IE
6IF
6IH
6IK
6IL
6IM
6IN
8US
AAJGR
AAWTH
ABLEC
ADZIZ
AFFNX
ALMA_UNASSIGNED_HOLDINGS
APO
ARCSS
AVWKF
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
EDO
FEDTE
I-F
I07
IEGSK
IJVOP
IPLJI
M43
OCL
RIE
RIL
RIO
RNS
XOL
ID FETCH-LOGICAL-a275t-f76a039dbec7c1b95743bf4990f7d11f5fa76d5041f662a144aa50d2d24581553
IEDL.DBID RIE
ISICitedReferencesCount 72
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000832185400162&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:28:32 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a275t-f76a039dbec7c1b95743bf4990f7d11f5fa76d5041f662a144aa50d2d24581553
PageCount 13
ParticipantIDs ieee_primary_9793930
PublicationCentury 2000
PublicationDate 2022-May
PublicationDateYYYYMMDD 2022-05-01
PublicationDate_xml – month: 05
  year: 2022
  text: 2022-May
PublicationDecade 2020
PublicationTitle 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)
PublicationTitleAbbrev ICSE
PublicationYear 2022
Publisher ACM
Publisher_xml – name: ACM
SSID ssj0006499
ssj0002871777
Score 2.5404866
Snippet Recent years have seen the successful application of large pretrained models to code representation learning, resulting in substantial improvements on many...
SourceID ieee
SourceType Publisher
StartPage 01
SubjectTerms code representation learning
Codes
Computer architecture
Decoding
Natural languages
pre-training
Representation learning
sequence-to-sequence
Task analysis
Transformers
Title SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations
URI https://ieeexplore.ieee.org/document/9793930
WOSCitedRecordID wos000832185400162&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELVKxcBUoEV8ywMjbmPHHzFrRcVURTRI3So7thFLgtqU34_tuu3CwhQrUqLoLvadz_feA-BJYEO1xBoRxTJEC2qQttYhUouiNj5PsqqIYhNiPi-WS1n2wPMBC2Otjc1ndhyG8SzftPU2lMom0v9MMvcb9BMh-A6rdainhMw_UtulVZj7VD5R-WDKJjkLhex8HK-Rof-opRJDyWzwv484B6MjJg-Wh2hzAXq2uQSDvSgDTHN0CKpFWaFpa-wLXKQ2adS1aD_277CoSroQ0GesMDGsfsJFrOPD8Cx8jw2yCZfUbEbgY_ZaTd9Qkk5AigjWISe4ynJpvIdEjbVkPlHQzpskc8Jg7JhTghuWUew4J8rvqpT3lSGGUFYEKaEr0G_axl4D6Cw1xOURceqjP9OGmpo7Kan2viz0DRgGI62-d-wYq2Sf279v34EzEgAEsWXwHvS79dY-gNP6p_varB-jS38BIYChTw
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwED1VgARTgRbxjQdG3CaOHcesFVURpapokLpVcWwjlgS1Kb8f27jtwsIUK1Ki6C72nc_33gO457GiUsQSk4JFmGZUYam1waTkWalsnqSLzItN8Mkkm8_FtAUPWyyM1to3n-meG_qzfFWXa1cq6wv7M4nEbtD3nXJWQGttKyou9_fkdmEdTm0yH8h8Ysr6CXOl7KTnr56jf6em4oPJsP2_zziG7g6Vh6bbeHMCLV2dQnsjy4DCLO1APpvmeFAr_YhmoVEaNzXejO07NM6DMgSyOSsKHKsfaOYr-cg9i958i2xAJlWrLrwPn_LBCAfxBFwQzhpseFpEiVDWR7yMpWA2VZDGmiQyXMWxYabgqWIRjU2aksLuqwrrLUUUoSxzYkJnsFfVlT4HZDRVxCQec2rjP5OKqjI1QlBpvZnJC-g4Iy2-fvkxFsE-l3_fvoPDUf46XoyfJy9XcEQcnMA3EF7DXrNc6xs4KL-bz9Xy1rv3B7qYpJg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2022+IEEE%2FACM+44th+International+Conference+on+Software+Engineering+%28ICSE%29&rft.atitle=SPT-Code%3A+Sequence-to-Sequence+Pre-Training+for+Learning+Source+Code+Representations&rft.au=Niu%2C+Changan&rft.au=Li%2C+Chuanyi&rft.au=Ng%2C+Vincent&rft.au=Ge%2C+Jidong&rft.date=2022-05-01&rft.pub=ACM&rft.eissn=1558-1225&rft.spage=01&rft.epage=13&rft_id=info:doi/10.1145%2F3510003.3510096&rft.externalDocID=9793930