Revisiting Information Retrieval and Deep Learning Approaches for Code Summarization

Code summarization refers to the procedure of creating short descriptions that outline the semantics of source code snippets. Existing code summarization approaches can be broadly classified into Information Retrieval (IR)-based and Deep Learning (DL)-based approaches. However, their effectiveness,...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Proceedings (IEEE/ACM International Conference on Software Engineering Companion. Online) s. 328 - 329
Hlavní autoři: Zhu, Tingwei, Li, Zhong, Pan, Minxue, Shi, Chaoxuan, Zhang, Tian, Pei, Yu, Li, Xuandong
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.05.2023
Témata:
ISSN:2574-1934
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Code summarization refers to the procedure of creating short descriptions that outline the semantics of source code snippets. Existing code summarization approaches can be broadly classified into Information Retrieval (IR)-based and Deep Learning (DL)-based approaches. However, their effectiveness, and especially their strengths and weaknesses, remain largely understudied. Existing evaluations use different benchmarks and metrics, making performance comparisons of these approaches susceptible to bias and potentially yielding misleading results. For example, the DL-based approaches typically show better code summarization performance in their original papers [1], [2]. However, Gros et al. [3] report that a naive IR approach could achieve comparable (or even better) performance to the DL-based ones. In addition, some recent work [4], [5] suggests that incorporating IR techniques can improve the DL-based approaches. To further advance code summarization techniques, it is critical that we have a good understanding of how IR-based and DL-based approaches perform on different datasets and in terms of different metrics. Prior works have studied some aspects of code summarization, such as the factors affecting performance evaluation [6] and the importance of data preprocessing [7], etc. In this paper, we focus on the study of the IR-based and DL-based code summarization approaches to enhance the understanding and design of more advanced techniques. We first compare the IR-based and DL-based approaches under the same experimental settings and benchmarks, then study their strengths and limitations through quantitative and qualitative analyses. Finally, we propose a simpler but effective strategy to combine IR and DL to further improve code summarization. Four IR-based approaches and two DL-based approaches are investigated with regard to representativeness and diversity. For IR-based approaches, we select three BM25-based approaches (i.e., BM25-spl, BM25-ast, BM25-alpha) and one nearest neighbor-based approach NNGen [8], which are often compared as baselines in prior works. They retrieve the most similar code from the database and directly output the corresponding summary. BM25-based approaches are implemented by Lucene [9]. Taking code forms as input, BM25-spl splits the CamelCase and snake_case in original source code tokens, BM25-ast obtains sequence representations using pre-order Abstract Syntax Tree (AST) traversal, and BM25-alpha keeps only the alpha tokens in the code. For DL-based approaches, we choose the state-of-the-art pre-trained model PLBART [2] and the trained-from-scratch model SiT[1]. We adopt four widely used Java datasets, namely TLC [10], CSN [11], HDC [12], and FCM [13] as our subject datasets. TLC and HDC are method-split datasets, where methods in the same project are randomly split into training/validation/test sets. CSN and FCM are project-split datasets, where examples from the same project exist in only one partition. We further process the four datasets to build cleaner datasets by removing examples that have syntax errors, empty method bodies, and too long or too short sequence lengths, etc. We also remove the duplicate examples in the validation and test sets. To comprehensively and systematically evaluate the performance of the code summarization approaches, we adopt three widely used metrics, i.e., BLEU (both C-BLEU and S-BLEU are included), ROUGE, and METEOR, in our experiments.
AbstractList Code summarization refers to the procedure of creating short descriptions that outline the semantics of source code snippets. Existing code summarization approaches can be broadly classified into Information Retrieval (IR)-based and Deep Learning (DL)-based approaches. However, their effectiveness, and especially their strengths and weaknesses, remain largely understudied. Existing evaluations use different benchmarks and metrics, making performance comparisons of these approaches susceptible to bias and potentially yielding misleading results. For example, the DL-based approaches typically show better code summarization performance in their original papers [1], [2]. However, Gros et al. [3] report that a naive IR approach could achieve comparable (or even better) performance to the DL-based ones. In addition, some recent work [4], [5] suggests that incorporating IR techniques can improve the DL-based approaches. To further advance code summarization techniques, it is critical that we have a good understanding of how IR-based and DL-based approaches perform on different datasets and in terms of different metrics. Prior works have studied some aspects of code summarization, such as the factors affecting performance evaluation [6] and the importance of data preprocessing [7], etc. In this paper, we focus on the study of the IR-based and DL-based code summarization approaches to enhance the understanding and design of more advanced techniques. We first compare the IR-based and DL-based approaches under the same experimental settings and benchmarks, then study their strengths and limitations through quantitative and qualitative analyses. Finally, we propose a simpler but effective strategy to combine IR and DL to further improve code summarization. Four IR-based approaches and two DL-based approaches are investigated with regard to representativeness and diversity. For IR-based approaches, we select three BM25-based approaches (i.e., BM25-spl, BM25-ast, BM25-alpha) and one nearest neighbor-based approach NNGen [8], which are often compared as baselines in prior works. They retrieve the most similar code from the database and directly output the corresponding summary. BM25-based approaches are implemented by Lucene [9]. Taking code forms as input, BM25-spl splits the CamelCase and snake_case in original source code tokens, BM25-ast obtains sequence representations using pre-order Abstract Syntax Tree (AST) traversal, and BM25-alpha keeps only the alpha tokens in the code. For DL-based approaches, we choose the state-of-the-art pre-trained model PLBART [2] and the trained-from-scratch model SiT[1]. We adopt four widely used Java datasets, namely TLC [10], CSN [11], HDC [12], and FCM [13] as our subject datasets. TLC and HDC are method-split datasets, where methods in the same project are randomly split into training/validation/test sets. CSN and FCM are project-split datasets, where examples from the same project exist in only one partition. We further process the four datasets to build cleaner datasets by removing examples that have syntax errors, empty method bodies, and too long or too short sequence lengths, etc. We also remove the duplicate examples in the validation and test sets. To comprehensively and systematically evaluate the performance of the code summarization approaches, we adopt three widely used metrics, i.e., BLEU (both C-BLEU and S-BLEU are included), ROUGE, and METEOR, in our experiments.
Author Li, Xuandong
Li, Zhong
Shi, Chaoxuan
Pan, Minxue
Pei, Yu
Zhu, Tingwei
Zhang, Tian
Author_xml – sequence: 1
  givenname: Tingwei
  surname: Zhu
  fullname: Zhu, Tingwei
  organization: Nanjing University,Nanjing,China
– sequence: 2
  givenname: Zhong
  surname: Li
  fullname: Li, Zhong
  organization: Nanjing University,Nanjing,China
– sequence: 3
  givenname: Minxue
  surname: Pan
  fullname: Pan, Minxue
  organization: Nanjing University,Nanjing,China
– sequence: 4
  givenname: Chaoxuan
  surname: Shi
  fullname: Shi, Chaoxuan
  organization: Nanjing University,Nanjing,China
– sequence: 5
  givenname: Tian
  surname: Zhang
  fullname: Zhang, Tian
  organization: Nanjing University,Nanjing,China
– sequence: 6
  givenname: Yu
  surname: Pei
  fullname: Pei, Yu
  organization: The Hong Kong Polytechnic University,Hong Kong,China
– sequence: 7
  givenname: Xuandong
  surname: Li
  fullname: Li, Xuandong
  organization: Nanjing University,Nanjing,China
BookMark eNotjk1Lw0AURUdRsNb8Axezcpc6My-ZySxLbDUQENq6Lq-TFx1pJiGJBf31xo_Vhcs9h3vNLkIbiLE7KRZSCntf5NtVnLdNh8G3Ic10li2UULAQQlh5xiJrbAapAKU0mHM2U6lJYmkhuWLRMLxPM1ACEmtmbLehkx_86MMrL0Ld9g2Ok5RvaOw9nfDIMVT8gajjJWEffnbLrutbdG808AngeVsR3340Dfb-65e-YZc1HgeK_nPOXtarXf4Ul8-PRb4sYwShx9jpTCU16dQJCZXGg0xcTeQqoMpqRJAGFB6UIGdrlxJNferQaCclJGRgzm7_vJ6I9l3vpwufeymkUZmx8A2qAllX
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ICSE-Companion58688.2023.00091
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350322637
EISSN 2574-1934
EndPage 329
ExternalDocumentID 10172879
Genre orig-research
GroupedDBID 6IE
6IF
6IL
6IN
AAJGR
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
OCL
RIE
RIL
ID FETCH-LOGICAL-a306t-c6824fe65c013d6ab14cfeecd3ed96aa31732ab20ec9fc5ee3ed5ca76c1134e73
IEDL.DBID RIE
ISICitedReferencesCount 5
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001032641300079&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:21:00 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a306t-c6824fe65c013d6ab14cfeecd3ed96aa31732ab20ec9fc5ee3ed5ca76c1134e73
PageCount 2
ParticipantIDs ieee_primary_10172879
PublicationCentury 2000
PublicationDate 2023-May
PublicationDateYYYYMMDD 2023-05-01
PublicationDate_xml – month: 05
  year: 2023
  text: 2023-May
PublicationDecade 2020
PublicationTitle Proceedings (IEEE/ACM International Conference on Software Engineering Companion. Online)
PublicationTitleAbbrev ICSE-COMPANION
PublicationYear 2023
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003203497
ssib051921307
Score 2.2798862
Snippet Code summarization refers to the procedure of creating short descriptions that outline the semantics of source code snippets. Existing code summarization...
SourceID ieee
SourceType Publisher
StartPage 328
SubjectTerms Code summarization
Codes
Deep learning
empirical study
information retrieval
Java
Performance evaluation
Semantics
Source coding
Syntactics
Title Revisiting Information Retrieval and Deep Learning Approaches for Code Summarization
URI https://ieeexplore.ieee.org/document/10172879
WOSCitedRecordID wos001032641300079&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ1NSwMxEIaDLSKeVKz4TQ7iLbZNNsnuUapFQUppq_RWksmseNmWfvj7TbKpevHgbQkENrOTncnuvM8QcpNbUI4XhiFaYJlCy3JlkXGhhCt9PILYkuXtRQ8G-XRaDJNYPWphEDEWn-FduIz_8t0cNuFTWTu4j8_wiwZpaK1qsdbWeWQAe23ZVuE1LHhAr-g9cpu4mu3n3viR1dvMr1nmKg-1XTwgTjuB0fmrvUqMLv2Df97XIWn96PTo8DsCHZEdrI7JZBQF46GcmSaxUTA-HcXeWd6xqKkcfUBc0ARXfaf3iSyOK-on0N7cIR1HWVuSabbIa_9x0ntiqXcCM_4QsGagcp6VqCT4HM8pY7sZlIjgBLpCGePTBsGN5R2EogSJ6MclGK2g2xUZanFCmtW8wlNCCw1aC6eVlC7zNjWZ9fMDB6fsgI91Z6QVTDJb1HiM2dYa53-MX5D9YPW6avCSNNfLDV6RXfhcf6yW1_GhfgEAc6On
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ3PT8IwFMcbRaOe1Ijxtz0Yb5XRn9vRIAQiEgJouJGufTNeBoHh32_bDfXiwdvSpMn6-tb3tr3v5yF0F6dGWppoApAawiWkJJYpEMoks5mLRya0ZHnrq8Egnk6TYSVWD1oYAAjFZ_DgL8O_fDs3a_-prOHdx2X4yTbaEZzTqJRrbdxHeLTXhm7lD2JGPXxF7aH7iqzZ6LXGbVI-aG7VIpaxr-6iHnIaeUrnrwYrIb50Dv95Z0eo_qPUw8PvGHSMtiA_QZNRkIz7gmZcyY28-fEodM9yroV1bvETwAJXeNV3_FixxWGF3QTcmlvA4yBsq4SadfTaaU9aXVJ1TyDavQYUxMiY8gykMC7Ls1KnTW4yAGMZ2ERq7RIHRnVKIzBJZgSAGxdGK2maTcZBsVNUy-c5nCGcKKMUs0oKYbmzqeapm-9JOFlkXLQ7R3VvktmiBGTMNta4-GP8Fu13Jy_9Wb83eL5EB34HyhrCK1Qrlmu4Rrvms_hYLW_CBn8BsvKm7g
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE%2FACM+International+Conference+on+Software+Engineering+Companion.+Online%29&rft.atitle=Revisiting+Information+Retrieval+and+Deep+Learning+Approaches+for+Code+Summarization&rft.au=Zhu%2C+Tingwei&rft.au=Li%2C+Zhong&rft.au=Pan%2C+Minxue&rft.au=Shi%2C+Chaoxuan&rft.date=2023-05-01&rft.pub=IEEE&rft.eissn=2574-1934&rft.spage=328&rft.epage=329&rft_id=info:doi/10.1109%2FICSE-Companion58688.2023.00091&rft.externalDocID=10172879